[SPARK-22672][SQL][TEST] Refactor ORC Tests #19882

dongjoon-hyun · 2017-12-04T21:35:15Z

What changes were proposed in this pull request?

Since SPARK-20682, we have two OrcFileFormats. This PR refactors ORC tests with three principles (with a few exceptions)

Move test suite into sql/core.
Create HiveXXX test suite in sql/hive by reusing sql/core test suite.
OrcTest will provide common helper functions and val orcImp: String.

Test Suites

Native OrcFileFormat

org.apache.spark.sql.hive.orc
- OrcFilterSuite
- OrcPartitionDiscoverySuite
- OrcQuerySuite
- OrcSourceSuite
o.a.s.sql.hive.orc
- OrcHadoopFsRelationSuite

Hive built-in OrcFileFormat

o.a.s.sql.hive.orc
- HiveOrcFilterSuite
- HiveOrcPartitionDiscoverySuite
- HiveOrcQuerySuite
- HiveOrcSourceSuite
- HiveOrcHadoopFsRelationSuite

Hierarchy

OrcTest
    -> OrcSuite
        -> OrcSourceSuite
    -> OrcQueryTest
        -> OrcQuerySuite
    -> OrcPartitionDiscoveryTest
        -> OrcPartitionDiscoverySuite
    -> OrcFilterSuite

HadoopFsRelationTest
    -> OrcHadoopFsRelationSuite
        -> HiveOrcHadoopFsRelationSuite

Please note the followings.

Unlike the other test suites, OrcHadoopFsRelationSuite doesn't inherit OrcTest. It is inside sql/hive like ParquetHadoopFsRelationSuite due to the dependencies and follows the existing convention to use val dataSourceName: String
OrcFilterSuites cannot reuse test cases due to the different function signatures using Hive 1.2.1 ORC classes and Apache ORC 1.4.1 classes.

How was this patch tested?

Pass the Jenkins tests with reorganized test suites.

dongjoon-hyun · 2017-12-04T21:52:46Z

Hi, @cloud-fan , @gatorsmile , @HyukjinKwon , @viirya .
This is a test case restructure after #19651 .

SparkQA · 2017-12-05T00:19:38Z

Test build #84443 has finished for PR 19882 at commit 5f2025a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class OrcFilterSuite extends OrcTest with SharedSQLContext
implicit class IntToBinary(int: Int)
case class OrcParData(intField: Int, stringField: String)
case class OrcParDataWithKey(intField: Int, pi: Int, stringField: String, ps: String)
abstract class OrcPartitionDiscoveryTest extends OrcTest
class OrcPartitionDiscoverySuite extends OrcPartitionDiscoveryTest with SharedSQLContext
case class AllDataTypesWithNonPrimitiveType(
case class BinaryData(binaryData: Array[Byte])
case class Contact(name: String, phone: String)
case class Person(name: String, age: Int, contacts: Seq[Contact])
abstract class OrcQueryTest extends OrcTest
test(\"Creating case class RDD table\")
test(\"save and load case class RDD withNones as orc\")
class OrcQuerySuite extends OrcQueryTest with SharedSQLContext
case class OrcData(intField: Int, stringField: String)
abstract class OrcSuite extends OrcTest with BeforeAndAfterAll
class OrcSourceSuite extends OrcSuite with SharedSQLContext
abstract class OrcTest extends QueryTest with SQLTestUtils
class HiveOrcFilterSuite extends OrcTest with TestHiveSingleton
implicit class IntToBinary(int: Int)
class HiveOrcPartitionDiscoverySuite extends OrcPartitionDiscoveryTest with TestHiveSingleton
class HiveOrcQuerySuite extends OrcQueryTest with TestHiveSingleton
class HiveOrcSourceSuite extends OrcSuite with TestHiveSingleton
class HiveOrcHadoopFsRelationSuite extends OrcHadoopFsRelationSuite

HyukjinKwon · 2017-12-05T00:25:07Z

Whoa big class list. Will take a look soon within tomorrow as well.

dongjoon-hyun · 2017-12-05T00:48:15Z

Thank you so much, @HyukjinKwon !

viirya · 2017-12-05T01:23:16Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala

+          emptyDF.createOrReplaceTempView("empty")
+
+          // This creates 1 empty ORC file with Hive ORC SerDe.  We are using this trick because
+          // Spark SQL ORC data source always avoids write empty ORC files.


Is this still using Hive ORC SerDe?

Thank you for review, @viirya . I'll update tomorrow.

cloud-fan · 2017-12-05T07:26:07Z

OrcTest will provide common helper functions and def format: String.

Instead of having def format: String, can we just add beforeAll and afterAll in the test suites to set the ORC_IMPLEMENTATION?

dongjoon-hyun · 2017-12-05T07:34:49Z

It has more lines, doesn't it? In any way, we need helper functions.
And, when we remove old Hive OrcFileFormat and the conf later, this will reduce the change.

cloud-fan · 2017-12-05T09:10:20Z

ok maybe have a def orcImp: String, which can be native or hive. Then we can put the beforeAll and afterAll in OrcTest.

It can avoid changing the test code from spark.read.orc to spark.read.format(format)

dongjoon-hyun · 2017-12-05T16:43:09Z

Okay. No problem. Thanks, @cloud-fan .

gatorsmile · 2017-12-05T19:53:47Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala

+/**
+ * A test suite that tests Apache ORC filter API based filter pushdown optimization.
+ */
+class OrcFilterSuite extends OrcTest with SharedSQLContext {


Let HiveOrcFilterSuite extend OrcFilterSuite?

Ur, it's impossible because of the reason I mentioned in PR description.

OrcFilterSuite and HiveOrcFilterSuite cannot reuse test cases due to the different function signatures using Hive 1.2.1 ORC classes and Apache ORC 1.4.1 classes.

Could we leave some comments to explain that reason?
Seems there are many duplications and I would wonder why.

SparkQA · 2017-12-06T06:31:30Z

Test build #84524 has finished for PR 19882 at commit baec5fe.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class OrcFilterSuite extends OrcTest with SharedSQLContext
implicit class IntToBinary(int: Int)
case class OrcParData(intField: Int, stringField: String)
case class OrcParDataWithKey(intField: Int, pi: Int, stringField: String, ps: String)
abstract class OrcPartitionDiscoveryTest extends OrcTest
case class AllDataTypesWithNonPrimitiveType(
case class BinaryData(binaryData: Array[Byte])
case class Contact(name: String, phone: String)
case class Person(name: String, age: Int, contacts: Seq[Contact])
abstract class OrcQueryTest extends OrcTest
test(\"Creating case class RDD table\")
test(\"save and load case class RDD withNones as orc\")
class OrcQuerySuite extends OrcQueryTest with SharedSQLContext
case class OrcData(intField: Int, stringField: String)
abstract class OrcSuite extends OrcTest with BeforeAndAfterAll
class OrcSourceSuite extends OrcSuite with SharedSQLContext
abstract class OrcTest extends QueryTest with SQLTestUtils with BeforeAndAfterAll
class HiveOrcFilterSuite extends OrcTest with TestHiveSingleton
implicit class IntToBinary(int: Int)
class HiveOrcPartitionDiscoverySuite extends OrcPartitionDiscoveryTest with TestHiveSingleton
class HiveOrcQuerySuite extends OrcQueryTest with TestHiveSingleton
class HiveOrcSourceSuite extends OrcSuite with TestHiveSingleton
class HiveOrcHadoopFsRelationSuite extends OrcHadoopFsRelationSuite

cloud-fan · 2017-12-06T07:59:05Z

...c/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcPartitionDiscoverySuite.scala

-      read
-        .option(ConfVars.DEFAULTPARTITIONNAME.varname, defaultPartitionName)
+      spark.read
+        .option("hive.exec.default.partition.name", defaultPartitionName)


the new ORC didn't change these config names?

Yes. In fact, Apache ORC doesn't have this params.

cloud-fan · 2017-12-06T08:02:45Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcTest.scala

  import testImplicits._

+  def orcImp: String = "native"
+
+  var originalConfORCImplementation = "native"


private var?

cloud-fan · 2017-12-06T08:04:38Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala

+
+  override def orcImp: String = "hive"
+
+  test("SPARK-8501: Avoids discovery schema from empty ORC files") {


why does this test not in native orc test?

Native ORC solve this bug. Native ORC have a corresponding test case here.

+ test("Schema discovery on empty ORC files") { + // SPARK-8501 is fixed.

then why don't we put this test in the base class?

This only works in new OrcFileFormat.

The new test case is in OrcQuerySuite for new OrcFileFormat.

And, the old test case is HiveOrcQuerySuite for old OrcFileFormat.

cloud-fan · 2017-12-06T08:06:27Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala

+
+    spark.sql(
+      s"""CREATE TEMPORARY VIEW normal_orc_source
+         |USING org.apache.spark.sql.hive.orc


this can be using orc?

Sure. It just comes from the original test case.

HyukjinKwon · 2017-12-06T12:52:09Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala

+    }
+  }
+
+  test("SPARK-21791 ORC should support column names with dot") {


Would we need this test case for Hive's one too?

Old OrcFileFormat fails on this test case.
Do you mean adding an exception-catching test case?

Oh, I overlooked. Sure, that's fine.

HyukjinKwon · 2017-12-06T12:59:55Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala

-  override def beforeAll(): Unit = {
+class OrcSourceSuite extends OrcSuite with SharedSQLContext {
+
+  protected override def beforeAll(): Unit = {


Mind if I ask where this is needed?

The test cases of OrcSuite assume these tables.

HyukjinKwon · 2017-12-06T13:00:28Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcTest.scala

  import testImplicits._

+  def orcImp: String = "native"


Maybe val could be used.

dongjoon-hyun · 2017-12-06T23:36:04Z

It's rebased to the master to resolve conflicts. Also, I addressed the comments. Thanks!

HyukjinKwon

Loosely related though, should we maybe rename org.apache.spark.sql.hive.orc.Orc* -> org.apache.spark.sql.hive.orc.HiveOrc* in the main codes too to distinguish the newer ORC from the old Hive ORC?

HyukjinKwon · 2017-12-07T00:05:51Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala

+/**
+ * A test suite that tests Apache ORC filter API based filter pushdown optimization.
+ */
+class OrcFilterSuite extends OrcTest with SharedSQLContext {


Could we leave some comments to explain that reason?
Seems there are many duplications and I would wonder why.

HyukjinKwon · 2017-12-07T01:37:35Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala

+
+  test("filter pushdown - combinations with logical operators") {
+    withOrcDataFrame((1 to 4).map(i => Tuple1(Option(i)))) { implicit df =>
+      // Because `ExpressionTree` is not accessible at Hive 1.2.x, this should be checked


I wrote the original tests here like this using toString partly because ExpressionTree (SearchArgument.getExpression) it's inaccessible and the string format is easy to read.

Although I think that tree seems available in the ORC, I think it's okay to keep the tests like this. It's easy to read but let's fix up the comments here. It doesn't look about Hive anyway here.

Yep, I'll remove the comment. For the test case, I agree with you. And also this string tests will be consistent with Hive Orc tests for a while.

HyukjinKwon · 2017-12-07T02:03:29Z

LGTM BTW.

SparkQA · 2017-12-07T02:20:16Z

Test build #84581 has finished for PR 19882 at commit fcb2ccb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-12-07T04:28:07Z

Thank you so much, @HyukjinKwon !

dongjoon-hyun · 2017-12-07T05:02:12Z

@HyukjinKwon . The PR code and description is updated like the followings.

Update comments
Rename back to the originals

The main reason I used prefix Hive is for naming consistency, but now it's restored.

As you see in the PR description, we need to use HiveOrcHadoopFsRelationSuite
because OrcHadoopFsRelationSuite already exists in the same package.

HyukjinKwon · 2017-12-07T05:09:52Z

I actually suggested similarly before:

org.apache.spark.sql.execution.datasources.csv.InferSchema
org.apache.spark.sql.execution.datasources.json.InferSchema

but I remember I received an advise at that time and it became as below:

org.apache.spark.sql.execution.datasources.csv.CSVInferSchema
org.apache.spark.sql.execution.datasources.json.JsonInferSchema

After rethinking it, I realised this is better.

Likewise, I actually liked Hive prefix. It was easier to distinguish.

HyukjinKwon · 2017-12-07T05:12:55Z

I meant in #19882 (review), I liked this Hive prefix here so wondered if we could do the same thing for the main codes too. Since this PR only refactors the tests, it's loosely related though.

dongjoon-hyun · 2017-12-07T05:15:47Z

Oh. I'll bring back.

This reverts commit 1828571.

HyukjinKwon · 2017-12-07T05:18:34Z

I am sorry, I had to clarify this ahead ..

dongjoon-hyun · 2017-12-07T05:20:38Z

Definitely, my bad.

For the main code, we can do later in a separate PR if needed. I hope this PR contains tests only.

SparkQA · 2017-12-07T07:52:38Z

Test build #84589 has finished for PR 19882 at commit 1828571.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class OrcFilterSuite extends OrcTest with TestHiveSingleton
implicit class IntToBinary(int: Int)
class OrcPartitionDiscoverySuite extends OrcPartitionDiscoveryTest with TestHiveSingleton
class OrcQuerySuite extends OrcQueryTest with TestHiveSingleton
class OrcSourceSuite extends OrcSuite with TestHiveSingleton

SparkQA · 2017-12-07T08:05:01Z

Test build #84591 has finished for PR 19882 at commit f56423d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-12-07T08:06:40Z

retest this please

SparkQA · 2017-12-07T11:04:20Z

Test build #84599 has finished for PR 19882 at commit f56423d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-12-07T12:43:36Z

thanks, merging to master!

dongjoon-hyun · 2017-12-07T15:20:22Z

Thank you so much, @cloud-fan , @HyukjinKwon , and @gatorsmile !

## What changes were proposed in this pull request? During #19882, `conf` is mistakenly used to switch ORC implementation between `native` and `hive`. To affect `OrcTest` correctly, `spark.conf` should be used. ## How was this patch tested? Pass the tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #19931 from dongjoon-hyun/SPARK-22672-2.

viirya reviewed Dec 5, 2017

View reviewed changes

gatorsmile reviewed Dec 5, 2017

View reviewed changes

dongjoon-hyun mentioned this pull request Dec 6, 2017

[SPARK-20728][SQL] Make OrcFileFormat configurable between sql/hive and sql/core #19871

Closed

cloud-fan reviewed Dec 6, 2017

View reviewed changes

HyukjinKwon reviewed Dec 6, 2017

View reviewed changes

dongjoon-hyun added 2 commits December 6, 2017 15:21

[SPARK-22672][SQL][TEST] Refactor ORC Tests

278f4ac

Address comments

fcb2ccb

dongjoon-hyun mentioned this pull request Dec 7, 2017

[SPARK-22279][SQL] Turn on spark.sql.hive.convertMetastoreOrc by default #19499

Closed

HyukjinKwon reviewed Dec 7, 2017

View reviewed changes

Update comments and rename files.

1828571

Revert "Update comments and rename files."

19f4fd1

This reverts commit 1828571.

fix

f56423d

asfgit closed this in c1e5688 Dec 7, 2017

dongjoon-hyun deleted the SPARK-22672 branch December 7, 2017 15:20

dongjoon-hyun mentioned this pull request Dec 9, 2017

[SPARK-22672][SQL][TEST][FOLLOWUP] Fix to use spark.conf #19931

Closed


		override def orcImp: String = "hive"

		test("SPARK-8501: Avoids discovery schema from empty ORC files") {

[SPARK-22672][SQL][TEST] Refactor ORC Tests #19882

[SPARK-22672][SQL][TEST] Refactor ORC Tests #19882

Conversation

dongjoon-hyun commented Dec 4, 2017 • edited

What changes were proposed in this pull request?

How was this patch tested?

dongjoon-hyun commented Dec 4, 2017

SparkQA commented Dec 5, 2017

HyukjinKwon commented Dec 5, 2017

dongjoon-hyun commented Dec 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Dec 5, 2017

dongjoon-hyun commented Dec 5, 2017 • edited

cloud-fan commented Dec 5, 2017

dongjoon-hyun commented Dec 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Dec 6, 2017

HyukjinKwon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Dec 7, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Dec 7, 2017

SparkQA commented Dec 7, 2017

dongjoon-hyun commented Dec 7, 2017

dongjoon-hyun commented Dec 7, 2017 • edited

HyukjinKwon commented Dec 7, 2017 • edited

HyukjinKwon commented Dec 7, 2017

dongjoon-hyun commented Dec 7, 2017

HyukjinKwon commented Dec 7, 2017

dongjoon-hyun commented Dec 7, 2017

SparkQA commented Dec 7, 2017

SparkQA commented Dec 7, 2017

HyukjinKwon commented Dec 7, 2017

SparkQA commented Dec 7, 2017

cloud-fan commented Dec 7, 2017

dongjoon-hyun commented Dec 7, 2017

dongjoon-hyun commented Dec 4, 2017 •

edited

dongjoon-hyun commented Dec 5, 2017 •

edited

dongjoon-hyun Dec 7, 2017 •

edited

dongjoon-hyun commented Dec 7, 2017 •

edited

HyukjinKwon commented Dec 7, 2017 •

edited